Skip to content

Dev chunk optimization postprocessveppanel#390

Merged
FerriolCalvet merged 57 commits into
devfrom
dev-chunk-optimization-POSTPROCESSVEPPANEL
Apr 10, 2026
Merged

Dev chunk optimization postprocessveppanel#390
FerriolCalvet merged 57 commits into
devfrom
dev-chunk-optimization-POSTPROCESSVEPPANEL

Conversation

@migrau
Copy link
Copy Markdown
Member

@migrau migrau commented Nov 14, 2025

[copilot generated]

Performance Optimization: Chunked Processing for Large Panel Annotations

Overview

This PR introduces memory-efficient chunked processing for VEP annotation post-processing, enabling the pipeline to handle arbitrarily large panel annotations without memory constraints.

Changes Summary

✅ Implemented Chunking Optimizations

1. panel_postprocessing_annotation.py - Chunked VEP Output Processing

  • Chunk size: 100,000 lines
  • Implementation: Streaming pandas read with incremental output writing
  • Benefits:
    • Processes large VEP outputs without loading entire file into memory
    • Prevents OOM errors on panels with millions of variants
    • Maintains same output quality with predictable resource usage

Technical details:

chunk_size = 100000
reader = pd.read_csv(VEP_output_file, sep="\t", chunksize=chunk_size)

for i, chunk in enumerate(reader):
    processed_chunk = process_chunk(chunk, chosen_assembly, using_canonical)
    # Incremental write with header only on first chunk
    rich_out_file.write(processed_chunk.to_csv(header=(i == 0), index=False, sep="\t"))
    del processed_chunk
    gc.collect()  # Explicit memory cleanup

Process: CREATEPANELS:POSTPROCESSVEPPANEL

  • Takes per-chromosome output from VCFANNOTATEPANEL
  • Processes in 100k-line chunks
  • Status: ✅ Working successfully

2. panel_custom_processing.py - Chromosome-Based Chunked Loading

  • Chunk size: 1,000,000 lines
  • Strategy: Load only relevant chromosome data in chunks
  • Benefits:
    • Memory-efficient custom region annotation
    • Filters during read to minimize memory footprint

Technical details:

def load_chr_data_chunked(filepath, chrom, chunksize=1_000_000):
    reader = pd.read_csv(filepath, sep="\t", chunksize=chunksize, dtype={'CHROM': str})
    chr_data = []
    for chunk in reader:
        filtered = chunk[chunk["CHROM"] == chrom]
        if not filtered.empty:
            chr_data.append(filtered)
    return pd.concat(chr_data) if chr_data else pd.DataFrame()

Process: CUSTOMPROCESSING / CUSTOMPROCESSINGRICH

  • Processes custom genomic regions with updated annotations
  • Loads data per-chromosome to reduce memory usage

❌ VEP Cache Storage Location - No Performance Impact

What was tested:

  • Using VEP cache from beegfs storage (/workspace/datasets/vep or /data/bbg/datasets/vep)
  • Expected faster cache access vs. downloading on-the-fly

Results:

  • No significant runtime improvement for ENSEMBLVEP_VEP process
  • VEP annotation runtime is compute-bound, not I/O-bound
  • Network-attached storage performed equivalently to local cache
  • OS filesystem caching likely mitigates storage location differences

Commits:

  • 035a0c7 (April 3, 2025): Added VEP cache beegfs support
  • 8e40d83 (April 24, 2025): Removed VEP cache beegfs optimization (no benefit)

Current approach:

  • Cache location configurable via params.vep_cache
  • Defaults to downloading cache if not provided
  • Various config files specify beegfs paths for convenience, not performance

Resource Configuration

Updated resource limits for chunked processes:

withName: '(BBGTOOLS:DEEPCSA:CREATEPANELS:POSTPROCESSVEPPANEL*|...)' {
    cpus   = { 2 * task.attempt }
    memory = { 4.GB * task.attempt }
    time   = { 360.min * task.attempt }
}

Integration Points

Affected Subworkflows:

  • CREATEPANELSPOSTPROCESSVEPPANEL → processes VEP output in chunks
  • CUSTOMPROCESSING / CUSTOMPROCESSINGRICH → uses chunked loading for custom regions

Pipeline Flow:

SITESFROMPOSITIONS → VCFANNOTATEPANEL (VEP) 
    ↓
POSTPROCESSVEPPANEL (chunked processing) ← 100k line chunks
    ↓
CUSTOMPROCESSING (optional, chunked by chromosome)
    ↓
CREATECAPTUREDPANELS / CREATESAMPLEPANELS / CREATECONSENSUSPANELS

Testing

Tested on:

  • Large-scale panels (millions of variants)
  • Multiple configuration profiles (nanoseq, chip, kidney, etc.)

Validation:

  • Output correctness verified (same results as non-chunked version)
  • Memory usage remains stable across panel sizes
  • No OOM errors on large inputs

Performance Impact

Metric Before After
Memory usage Unbounded (full file in RAM) ~4 GB (controlled)
Max panel size Limited by available memory Unlimited
Runtime Similar Similar (no regression)
Reliability OOM on large panels Stable processing

Migration Notes

No breaking changes. Existing pipelines continue to work with improved memory efficiency.

Related Commits

  • 276152d: Chunking for panel_custom_processing.py
  • 035a0c7: VEP cache beegfs attempt (added)
  • 8e40d83: VEP cache beegfs removal (no performance gain)
  • Various fixes: 1dffd94, 945c129, d243ebc, etc. (resource tuning)

Conclusion

This PR successfully implements memory-efficient chunked processing for panel annotation post-processing, enabling the pipeline to scale to arbitrarily large panels without memory constraints. The VEP cache storage location experiment confirmed that computation, not I/O, is the bottleneck for annotation runtime.

@FerriolCalvet FerriolCalvet linked an issue Jan 5, 2026 that may be closed by this pull request
@FerriolCalvet FerriolCalvet added this to the Phase 2 milestone Jan 5, 2026
@migrau migrau requested a review from FerriolCalvet March 19, 2026 18:54
@m-huertasp
Copy link
Copy Markdown
Collaborator

Hi! While checking the cord bloods run (combining DupCaller and deepUMI callings) I saw that one of the places in which we have a bottleneck is in the postprocessveppanel and I found this PR. I've checked the python script and I think it could also be a bit more optimized by using polars instead of pandas (now that we have it in the deepCSA container) and trying to change the "apply" logic in some places. Just a "heads-up" to say that it could be done in the future so I don't forget.

Copy link
Copy Markdown
Member

@FerriolCalvet FerriolCalvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Miguel!

I left some comments and suggestions.
nothing critical.

  • only some minor fixes to pass the nextflow linting
  • update default chunk_size to 1M so that bigger panels get chunked

and then another comment is that maybe we might need to be more generous in terms of memory of some steps when running bigger cohorts, but we will see this as we start using it.

I would apply the suggestions if you agree and then merge it to dev so that it starts to get tested by all of us and we tune it from there

thankss!!

Comment thread nextflow_schema.json Outdated
Comment thread nextflow_schema.json Outdated
Comment thread nextflow.config Outdated
Comment thread subworkflows/local/createpanels/main.nf
Comment thread subworkflows/local/createpanels/main.nf Outdated
Comment thread subworkflows/local/createpanels/main.nf Outdated
label 'process_single'

conda "python=3.10.17 bioconda::pybedtools=0.12.0 conda-forge::polars=1.30.0 conda-forge::click=8.2.1 conda-forge::gcc_linux-64=15.1.0 conda-forge::gxx_linux-64=15.1.0"
container 'docker://bbglab/deepcsa_bed:latest'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the receipe of this container is not pushed to https://github.com/bbglab/containers-recipes
if you have it localized somewhere try to push it so that we have everything centralized there, but go ahead with the merge

@FerriolCalvet
Copy link
Copy Markdown
Member

Wait Miguel, we found some weird behaviour in the test run with the cord bloods I will let you know once we solve it

@migrau
Copy link
Copy Markdown
Member Author

migrau commented Mar 22, 2026

the results from omega were not deterministic, e.g:
image

Now, the test rounds numeric columns (dnds, pvalue, lower, upper) to 2 decimals before comparison.

On the other hand, the implementation of polars in bin/panel_custom_processing.py is much faster than pandas but it requires ~30% of RAM, because there is not chunking. Results tested and compared with the previous implementation and they have the same md5sum.

migrau added 6 commits March 30, 2026 15:34
…g in VEP annotation and adjust related schema defaults. Linting fixes
- Modified nextflow.config to include general reference paths and skip validation for specific parameters.
- Increased resource limits for processes to accommodate VEP execution.
- Changed panel_sites_chunk_size to 0 and disabled parameter validation.
- Added new input_maf.csv file with sample and VCF path data for testing.
@migrau
Copy link
Copy Markdown
Member Author

migrau commented Apr 10, 2026

MAF tests from current dev added 330690e

Test passed. New snapshots match the ones from dev.

$ nf-test test 

🚀 nf-test 0.9.2
https://www.nf-test.com
(c) 2021 - 2024 Lukas Forer and Sebastian Schoenherr


Test DEEPCSA Pipeline

  Test [4150b7be] 'Minimal features test run' PASSED (677.383s)
  Test [9a4541aa] 'MAF-based minimal features test run' PASSED (1019.239s)
  Test [1e61cc82] 'Omega analysis test run' PASSED (651.906s)
  Test [c650506f] 'MAF-based omega analysis test run' PASSED (1006.037s)
  Test [b4f95eff] 'Mutation density test run' PASSED (652.338s)
  Test [89f4b5ca] 'MAF input validation - fails when --input_maf is provided without --use_custom_depths' PASSED (7.543s)


SUCCESS: Executed 6 tests in 4014.461s

(nf-test) [deepCSA] [mgrau@irbccn34 deepCSA]

@migrau
Copy link
Copy Markdown
Member Author

migrau commented Apr 10, 2026

merge FROM dev done and test passed:

$ nf-test test 

🚀 nf-test 0.9.2
https://www.nf-test.com
(c) 2021 - 2024 Lukas Forer and Sebastian Schoenherr

Load .nf-test/plugins/nft-utils/0.0.3/nft-utils-0.0.3.jar

Test DEEPCSA Pipeline

  Test [4150b7be] 'Minimal features test run' PASSED (625.354s)
  Test [9a4541aa] 'MAF-based minimal features test run' PASSED (1050.716s)
  Test [1e61cc82] 'Omega analysis test run' PASSED (665.905s)
  Test [c650506f] 'MAF-based omega analysis test run' PASSED (1057.473s)
  Test [b4f95eff] 'Mutation density test run' PASSED (619.715s)
  Test [89f4b5ca] 'MAF input validation - fails when --input_maf is provided without --use_custom_depths' PASSED (8.228s)


SUCCESS: Executed 6 tests in 4036.851s

We can go ahead with the merge to dev

@FerriolCalvet FerriolCalvet merged commit cfa2a97 into dev Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment